-
Notifications
You must be signed in to change notification settings - Fork 13.2k
Thinking model disabled assistant prefill #15404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thinking model disabled assistant prefill #15404
Conversation
common/chat.cpp
Outdated
return COMMON_REASONING_FORMAT_DEEPSEEK; | ||
} else if (format == "deepseek-legacy") { | ||
return COMMON_REASONING_FORMAT_DEEPSEEK_LEGACY; | ||
} else if (format == "granite") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed that this was missing, so figured I'd add it (not strictly required for this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would recommend using auto
instead of adding a new type here (so the existing enum COMMON_REASONING_FORMAT_GRANITE
should be removed altogether - it is not actively used anywhere in the code base)
reasoning_format
is supposed to be used for selecting the output format, i.e. the JSON response format from server to client. It should have no effects to the content.
We should better document this to prevent future contributor from thinking that reasoning_format
meaning "template format", while it should be "server response schema"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I'll yank this out.
For the life of me, I couldn't find where "auto"
was actually being used. I found several places where there was a special case for _NONE
or _DEEPSEEK_LEGACY
, but I couldn't find _AUTO
anywhere other than being set as the default. Can you clarify how this gets used to auto-determine the right reasoning parser?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The long story is that the message.reasoning_content
API response format is initially introduced by deepseek hosted API, hence the name deepseek
This enum was added so in case openai wants to have their own API schema, then we can select between either deepseek
or openai
. However, the meaning is lost in time due to a combination of bad naming and poor documentation.
auto
was added as a no-brainer solution since deepseek reasoning_content
is now supported by many applications, while OAI migrated to their new Response API.
This reasoning_format
solely determines the API schema, it is unrelated to the notion of parser or anything at chat template layer. Parser is determined by the chat template itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that's very helpful context!
tools/server/server.cpp
Outdated
// 2. The chat template supports it | ||
bool enable_thinking = params_base.reasoning_budget != 0; | ||
if (enable_thinking && params_base.use_jinja) { | ||
common_chat_templates_inputs dummy_inputs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic could move into a free-function in chat.*
(something like common_chat_supports_enable_thinking
).
Also, an alternate implementation of this would be to be more explicit and do some kind of a switch
on the subtype of common_chat_params
, but that felt hard to maintain (similar to more places where an arch enum is required). Since this is a boot-time operation, it feels ok to do the extra template expansion to avoid another place where a developer would need to poke in architecture-specific logic.
363c664
to
17b4bf8
Compare
bb48ad1
to
0f07f58
Compare
Tried the following:
Request containing EDIT: I believe I am still getting |
Thanks for testing it @arichiardi. Any chance you can post both the command to launch the server and the full |
@ryan-mangeno if you have any cycles for testing, I'd love some help putting this one through its paces |
@gabe-l-hart this is my call (copied from
|
Ok, looking at this request, it doesn't look like this is actually stimulating the condition this PR is targeting since it does not end with an I modified the above request to have an # Incorrectly sent as a string -> error
curl -XPOST -d'{"model":"UNUSED", "messages":[{"role":"user","content":"Can you make images?"}, {"role":"assistant","content":"Nope, I sure can not"}],"stream":false,"temperature":0.0,"max_tokens":30,"chat_template_kwargs":{"enable_thinking":"false"}}' -H "Content-Type: application/json" -H "Content-Type: application/json" http://localhost:8081/v1/chat/completions
# Correctly sent as a bool -> ok
curl -XPOST -d'{"model":"UNUSED", "messages":[{"role":"user","content":"Can you make images?"}, {"role":"assistant","content":"Nope, I sure can not"}],"stream":false,"temperature":0.0,"max_tokens":30,"chat_template_kwargs":{"enable_thinking":false}}' -H "Content-Type: application/json" -H "Content-Type: application/json" http://localhost:8081/v1/chat/completions This behavior is a bit confusing given that the code is parsing the value as a string and explicitly checking against the string literal |
Ok! I think I've figured out my issue and it comes down to misusing |
I was not quite right before. This opens up a different question: what is the correct way to handle a user sending an invalid type in a chat template kwarg? This seems ill-defined since chat template kwargs are model-specific except for
My usual inclination is to do "best effort" (2) in this case, but that often leads to loosy-goosy APIs, so the more correct thing to do would be (1). @ngxson any thoughts on this one? |
I'd probably go for (2) myself. Additionally, if the param requires a Agree that leaving the value unset without any sort of feedback is sub-optimal. |
I've implemented the fail-fast path (reject any string in the Right now, any errors raised in |
It looks like more-nuanced error types in
I think (1) would be much better as it would be extensible to other methods within |
Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <[email protected]>
Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <[email protected]>
…value From what I can tell, this started as a Qwen3-specific keyword, but from the use in `chat.cpp` translates this inputs.enable_thinking to the right thinking kwarg for the given model, this is now more of a standardized kwarg, so it should always override the default value when sent as part of the chat_template_kwargs field in the API. Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <[email protected]>
With the use_jinja check, non-jinja models would enable thinking and always fail assistant prefill Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <[email protected]>
Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <[email protected]>
There are too many possible "truthy" / "falsy" strings and too many ambiguous strings that don't have a clear truthy/falsy value, so the simplest thing to do here is to reject the request. Ideally, this would be a 422 (Unprocessable Entity), but right now it's coming back as a 500. Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <[email protected]>
c0d4709
to
6adae51
Compare
EDIT: never mind, was using the wrong binary 😄 |
…g-model-disabled-agent-prefill * origin/master: (76 commits) scripts: add sqlite3 check for compare-commits.sh (ggml-org#15633) kv-cache : remove LLAMA_SET_ROWS checks (ggml-org#15505) gguf-py: byteswapping improvements (ggml-org#12851) cli : change log to warning to explain reason for stopping (ggml-org#15604) model-conversion : add mmproj conversion target (ggml-org#15628) cuda: Add cublasLt_static linking when GGML_STATIC is enabled (ggml-org#15622) server: higher timeout for tests (ggml-org#15621) presets : add qwen3-30B-a3b FIM (ggml-org#15616) HIP: Enable support for ggml_backend_cuda_register_host_buffer (ggml-org#15615) kv-cache : better estimate of n_kv for multi-sequence batches (ggml-org#15610) CANN: refactor mask handling and improve performance in FA (ggml-org#15561) ggml-cpu : add basic RVV support for vector f32 ops (ggml-org#15057) common : add -m to bash completion for --model [no ci] (ggml-org#15591) OpenCL: add fused group_norm/norm, mul, add (ggml-org#15314) tests : fix test-opt with GGML_BACKEND_DL (ggml-org#15599) SYCL: fix rms_norm_mul_add for tensor dim not a multiple of sg_size (ggml-org#15592) mtmd : fix mtmd ios build (ggml-org#15579) tests: add performance test for mul mat id (ggml-org#15543) llamafile: PowerPC Sgemm Optimization (ggml-org#15558) graph : fix assert in memory-less build_attn (ggml-org#15590) ...
@ngxson any progress on this review? I'm also happy for anyone else to look it over if you're over-full. |
@ggerganov any chance you can look this one over? |
tools/server/server.cpp
Outdated
const auto rendered_no_thinking = common_chat_templates_apply(chat_templates.get(), dummy_inputs); | ||
dummy_inputs.enable_thinking = true; | ||
const auto rendered_with_thinking = common_chat_templates_apply(chat_templates.get(), dummy_inputs); | ||
enable_thinking = rendered_no_thinking.prompt != rendered_with_thinking.prompt; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the logic of this check - seems we would disable thinking if the 2 rendered prompts are different. Why would that be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nvm, got it. This should be moved to a helper function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll double check in a few, but I think the logic as written is that if the two prompts are different (with and without thinking), the enable_thinking
bool is set to true
. If they match, the assumption is that the enable_thinking
flag is ignored in the template. This is certainly a heuristic since some templates might only honor it under certain conditions (eg the funky Granite 3.x XOR conditions with documents
and tools
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, this is moved to common/chat.*
now in the function common_chat_templates_support_enable_thinking
…g-model-disabled-agent-prefill * origin/master: (84 commits) CUDA: fastdiv, launch bounds for mmvq + q8_1 quant (ggml-org#15802) tests : add --list-ops and --show-coverage options (ggml-org#15745) gguf: gguf_writer refactor (ggml-org#15691) kv-cache : fix SWA checks + disable cacheless iSWA (ggml-org#15811) model-conversion : add --embeddings flag to modelcard.template [no ci] (ggml-org#15801) chat : fixed crash when Hermes 2 <tool_call> had a newline before it (ggml-org#15639) chat : nemotron thinking & toolcalling support (ggml-org#15676) scripts : add Jinja tester PySide6 simple app (ggml-org#15756) llama : add support for EmbeddingGemma 300m (ggml-org#15798) metal : Add template specialization for mul_mm_id w/ ne20 == 10 (ggml-org#15799) llama : set n_outputs to 1 to avoid 0 outputs mean-pooling (ggml-org#15791) CANN: Refactor ND to NZ workspace to be per-device (ggml-org#15763) server: add exceed_context_size_error type (ggml-org#15780) Document the new max GPU layers default in help (ggml-org#15771) ggml: add ops for WAN video model (cuda && cpu) (ggml-org#15669) CANN: Fix precision issue on 310I DUO multi-devices (ggml-org#15784) opencl: add hs=40 to FA (ggml-org#15758) CANN: fix acl_rstd allocation size in ggml_cann_rms_norm (ggml-org#15760) vulkan: fix mmv subgroup16 selection (ggml-org#15775) vulkan: don't use std::string in load_shaders, to improve compile time (ggml-org#15724) ...
…o common Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <[email protected]>
Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <[email protected]>
…upport * origin/master: Thinking model disabled assistant prefill (ggml-org#15404) Implement --log-colors with always/never/auto (ggml-org#15792) CUDA: fastdiv, launch bounds for mmvq + q8_1 quant (ggml-org#15802) tests : add --list-ops and --show-coverage options (ggml-org#15745) gguf: gguf_writer refactor (ggml-org#15691) kv-cache : fix SWA checks + disable cacheless iSWA (ggml-org#15811) model-conversion : add --embeddings flag to modelcard.template [no ci] (ggml-org#15801) chat : fixed crash when Hermes 2 <tool_call> had a newline before it (ggml-org#15639) chat : nemotron thinking & toolcalling support (ggml-org#15676) scripts : add Jinja tester PySide6 simple app (ggml-org#15756) llama : add support for EmbeddingGemma 300m (ggml-org#15798)
* feat: Set enable_thinking IFF not disabled and supported Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <[email protected]> * fix: Fix inverted logic condition for prefill error Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <[email protected]> * fix: Always parse the enable_thinking kwarg to overwrite the default value From what I can tell, this started as a Qwen3-specific keyword, but from the use in `chat.cpp` translates this inputs.enable_thinking to the right thinking kwarg for the given model, this is now more of a standardized kwarg, so it should always override the default value when sent as part of the chat_template_kwargs field in the API. Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <[email protected]> * fix: Don't limit tempalte expansion check to jinja With the use_jinja check, non-jinja models would enable thinking and always fail assistant prefill Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <[email protected]> * feat: Add the error text to json type errors in json_value Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <[email protected]> * feat: Explicitly reject string values for "enable_thinking" There are too many possible "truthy" / "falsy" strings and too many ambiguous strings that don't have a clear truthy/falsy value, so the simplest thing to do here is to reject the request. Ideally, this would be a 422 (Unprocessable Entity), but right now it's coming back as a 500. Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <[email protected]> * refactor: Move logic for detecting template enable_thinking support to common Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <[email protected]> * fix: Use raw pointer for common chat template function Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <[email protected]> --------- Signed-off-by: Gabe Goodhart <[email protected]>
* feat: Set enable_thinking IFF not disabled and supported Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <[email protected]> * fix: Fix inverted logic condition for prefill error Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <[email protected]> * fix: Always parse the enable_thinking kwarg to overwrite the default value From what I can tell, this started as a Qwen3-specific keyword, but from the use in `chat.cpp` translates this inputs.enable_thinking to the right thinking kwarg for the given model, this is now more of a standardized kwarg, so it should always override the default value when sent as part of the chat_template_kwargs field in the API. Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <[email protected]> * fix: Don't limit tempalte expansion check to jinja With the use_jinja check, non-jinja models would enable thinking and always fail assistant prefill Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <[email protected]> * feat: Add the error text to json type errors in json_value Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <[email protected]> * feat: Explicitly reject string values for "enable_thinking" There are too many possible "truthy" / "falsy" strings and too many ambiguous strings that don't have a clear truthy/falsy value, so the simplest thing to do here is to reject the request. Ideally, this would be a 422 (Unprocessable Entity), but right now it's coming back as a 500. Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <[email protected]> * refactor: Move logic for detecting template enable_thinking support to common Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <[email protected]> * fix: Use raw pointer for common chat template function Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <[email protected]> --------- Signed-off-by: Gabe Goodhart <[email protected]>
Fixes: #15401
cc @matteoserva from the original PR
Changes
enable_thinking
should be true based on whether the rendered template changes when enabledinput.enable_thinking
fromchat_template_kwargs.enable_thinking
if set explicitly thereenable_thinking
is true only when set by default for thinking models or explicitly via kwargTesting
All model types tested using the following set of calls
Thinking model w/ thinking enabled
Thinking model w/ thinking disabled
Non-Thinking model
./bin/llama-server -m ~/models/granite-3.0-2b-instruct/granite-3.0-2B-instruct-F16.gguf --jinja